Presented by: Liyan Tan, Yequan Zhao
24'Spring UCSB ECE 278A Digital Image Processing
Work distribution:
Estimate the three-dimensional structure of a scene by analyzing the motion of the camera and the changes across multiple images.
Mapping and Navigation: creating maps and navigating through environments.
Triangulation is a method used to determine the three-dimensional position of a point $P$ by measuring its projections $p$, $p'$ from two distinct camera views $O_1$ and $O_2$.
The projection of a real world point $P$ to the digital image plane is:
$$p = K \begin{bmatrix} I_3 & 0 \end{bmatrix} \begin{bmatrix} R & T \\ 0 & 1 \end{bmatrix} P = M P $$where:
Before:
Now:
$\color{#EF5645}{\text{Structure from Motion}}$ is a class of methods that simultaneously determine both the 3D structure of the scene and the parameters of the cameras.
$\color{#EF5645}{\text{Remark}}$: It is called "motion", because this can also be from one camera moving around an object.
$\color{#EF5645}{\text{Notations}}$: We have:
We are given:
We want to recover:
Depending on the class of the projection matrices, we get different SfM:
$\color{#EF5645}{\text{Given:}}$ Image points $x_{ij}$ representing $X_j$ in camera $i$ with projection matrix $M_i$:
$$x_{ij} = M_i X_j = A_i X_j + b_i$$$\color{#EF5645}{\text{Goal:}}$ Recover:
$\color{#EF5645}{\text{Given:}}$
$\color{#EF5645}{\text{Goal:}}$ Recover:
$\rightarrow$ there is a minimum number of $2D$ images that we need to solve this problem.
The SfM problem can be solved if we have at least enough equations for the number of unknowns, i.e. $$2 mn \geq 8m + 3n.$$
$\color{#047C91}{\text{Example}}$:
$\color{#EF5645}{\text{The Tomasi and Kanade Factorization Method}}$ is a method for solving the affine SfM.
It consists of two steps:
Our original equations are: $x_{ij} = \begin{bmatrix} A_i & b_i \end{bmatrix} X_j$, where:
$\color{#EF5645}{\text{The data centering step}}$ helps focusing the problem on first estimating:
... by transforming these original equations.
In practice, this step centers the set of 2D image points.
For each image $i$,
So that we transform our datasets of $x_{ij}$ into a dataset of $\hat x_{ij}$.
This gives:
$$\begin{align*} \hat x_{ij} &= x_{ij} - \frac{1}{n}\sum_{k=1}^n x_{ik} \\ &= A_i X_j + b_i - \frac{1}{n}\sum_{k=1}^n (A_iX_k + b_i)\\ &= A_i X_j - \frac{1}{n}\sum_{k=1}^n A_iX_k\\ &= A_i (X_j - \frac{1}{n}\sum_{k=1}^n X_k)\\ &=A_i(X_j - \bar X)\\ &=A_i\hat X_j \end{align*} $$Thus, our new equations are: $\hat x_{ij} = A_i \hat X_j$, where:
$ \rightarrow$ we eliminated the estimation of the translations $b_i$.
Goal: Factor out $A_i$ and $\hat X_j$, for all $i$ and $j$.
$\rightarrow$ Gather everything in a big matrix.
$\color{#EF5645}{\text{The measurement matrix}}$ $D$ is defined as: $$D = \begin{bmatrix} \hat x_{11} & \hat x_{12} & ... & \hat x_{1n}\\ \hat x_{21} & \hat x_{22} & ... & \hat x_{2n}\\ ... & \hat ... & ... & ...\\ \hat x_{m1} & \hat x_{m2} & ... & \hat x_{mn}\\ \end{bmatrix} $$ where each $\hat x_{ij}$ is a $2 \times 1$ vector.
This $D$ is of shape $2m \times n$.
Thanks to the data centering step, the measurement matrix $D$ can be reformulated as: $$D = \begin{bmatrix} A_1 \hat X_1 & A_1 \hat X_2 & ... & A_1 \hat X_n\\ A_2 \hat X_1 & A_2 \hat X_2 & ... & A_2 \hat X_n\\ ... & ... & ... & ...\\ A_m \hat X_1 & A_m \hat X_2 & ... & A_m \hat X_n\\ \end{bmatrix} $$
which we write: $$D = MS$$
In the equation $D = MS$ we have:
Specifically, $M$ is the $2m \times 3$ matrix: $$M = \begin{bmatrix} A_1 \\ ... \\ A_m \end{bmatrix} $$ since each matrix $A_i$ is $2 \times 3$.
In the equation $D = MS$ we have:
Specifically, $S$ is the $3 \times n$ matrix: $$S = \begin{bmatrix} \hat X_1 & ... & \hat X_n \end{bmatrix} $$ since each matrix $X_j$ is a 3D vector.
Thus, our equation writes: $D = MS = \begin{bmatrix} A_1 \\ ... \\ A_m \end{bmatrix} \begin{bmatrix} \hat X_1 & ... & \hat X_n \end{bmatrix} $ where:
$\color{#6D7D33}{\text{Property}}$: Written as the above product, we know that:
$\color{#EF5645}{\text{The Tomasi and Kanade Factorization Method}}$ is a method for solving the affine SfM.
It consists of two steps:
We will factorize matrix $D$ using singular value decomposition: $$D = U \Sigma V^T,$$ and will recover $M$ and $S$ from combinations of $U, \Sigma, V$.
Brief introduction of SVD:
Since $rank(D) = 3$:
In the SVD: $D = U_3 \Sigma_3 V_3^T$:
$\color{#EF5645}{\text{Remark:}}$ Unfortunately, in practice, rank(D) > 3 because of measurement noise and the affine camera approximation.
However, when rank(D) > 3, $U_3 \Sigma_3 V_3^T$ is still the best possible rank-3 approximation of $M S$ in the sense of the Frobenius norm.
In the SVD: $D = U_3 \Sigma_3 V_3^T$, we note that:
$\rightarrow$ Assign $S = \Sigma_3 V_3^T$ and $M = U_3$?
But what about:
$\rightarrow$ Assign $S = V_3^T$ and $M = U_3 \Sigma_3 $?
Answer in Tomasi and Kanada's paper:
We say that there is an ambiguity in the choice of factorization, because any $3 \times 3$ invertible matrix $A$ may be inserted into the decomposition: $$D = M AA^{−1}S = (M A)(A^{−1}S)$$ This means that:
$\rightarrow$ Need extra constraints to solve ambiguity.
When a reconstruction has affine ambiguity, it means that parallelism is preserved, but the metric scale is unknown.
The fact that there is no way to recover the absolute scale of a scene from images is fairly intuitive. An object’s scale, absolute position and canonical orientation will always be unknown unless we make further assumptions (e.g, we know the height of the house in the figure) or incorporate more data. This is because some attributes may compensate for others. For instance, to get the same image, we can simply move the object backwards and scale it accordingly.